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Abstract 

We study the problem of constructing approximations to a weighted automaton. Weighted finite 
automata (WFA) are closely related to the theory of rational series. A rational series is a function from 
strings to real numbers that can be computed by a WFA. Among others, this includes probability dis¬ 
tributions generated by hidden Markov models and probabilistic automata. The relationship between 
rational series and WFA is analogous to the relationship between regular languages and ordinary au¬ 
tomata. Associated with such rational series are infinite matrices called Hankel matrices which play a 
fundamental role in the theory of minimal WFA. Our contributions are: (1) an effective procedure for 
computing the singular value decomposition (SVD) of such infinite Hankel matrices based on their finite 
representation in terms of WFA; (2) a new canonical form for WFA based on this SVD decomposition; 
and, (3) an algorithm to construct approximate minimizations of a given WFA. The goal of our approx¬ 
imate minimization algorithm is to start from a minimal WFA and produce a smaller WFA that is close 
to the given one in a certain sense. The desired size of the approximating automaton is given as input. 
We give bounds describing how well the approximation emulates the behavior of the original WFA. 

The study of this problem is motivated by the analysis of machine learning algorithms that synthetize 
weighted automata from spectral decompositions of finite Hankel matrices. It is known that when the 
number of states of the target automaton is correctly guessed, these algorithms enjoy consistency and 
finite-sample guarantees in the probably approximately correct (PAC) learning model. It has also been 
suggested that asking the learning algorithm to produce a model smaller than the true one will still 
yield useful models with reduced complexity. Our results in this paper vindicate these ideas and confirm 
intuitions provided by empirical studies. Beyond learning problems, our techniques can also be used to 
reduce the complexity of any algorithm working with WFA, at the expense of incurring a small, controlled 
amount of error. 


1 Introduction 

We address a relatively new issue for the logic and computation community: the approximate minimization 
of transition systems or automata. This concept is appropriate for systems that are quantitative in some 
sense: weighted automata, probabilistic automata of various kinds and timed automata. This paper focusses 
on weighted automata where we are able to make a number of contributions that combine ideas from duality 
with ideas from the theory of linear operators and their spectrum. Our new contributions are 

• An algorithm for the SVD decomposition of infinite Hankel matrices based on their representation in 
terms of weighted automata. 

• A new canonical form for weighted automata arising from the SVD of its corresponding Hankel matrix. 
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An algorithm to construct approximate minimizations of given weighted automata by truncating the 
canonical form. 




Minimization of automata has been a major subject since the 1950s, starting with the now classical work 
of the pioneers of automata theory. Recently there has been activity on novel algorithms for minimization 
based on duality mm which are ultimately based on a remarkable algorithm due to Brzozowski from the 
1960s [T9]. The general co-algebraic framework permits one to generalize Brzozowski’s algorithm to other 
classes of automata like weighted automata. 

Weighted automata are very useful in a variety of practical settings, such as machine learning (where 
they are used to represent predictive models for time series data and text), but also in the general theory of 
quantitative systems. There has also been interest in this type of representation, for example, in concurrency 
theory [181 and in semantics m We discuss the machine learning motivations at greater length, as they are 
the main driver for the present work. However, we emphasize that the genesis of one set of key ideas came 
from previous work on a coalgebraic view of minimization. 

Spectral techniques for learning latent variable models have recently drawn a lot of attention in the 
machine learning community. Following the significant milestone papers [SHE], in which an efficient spectral 
algorithm for learning hidden Markov models (HMM) and stochastic rational languages was given, the field 
has grown very rapidly. The original algorithm, which is based on singular value decompositions of Hankel 
matrices, has been extended to reduced-rank HMM [35], predictive state representations (PSR) [IT], finite- 
state transducers urns], and many other classes of functions on strings muni Ea. Although each of these 
papers works with slightly different problems and analyses techniques, the key ingredient turns out to be 
always the same: parametrize the target model as a weighted finite automaton (WFA) and learn this WFA 
from the SVD of a finite sub-block of its Hankel matrix [5], Therefore, it is possible (and desirable) to 
study all these learning algorithms from the point of view of rational series, which are exactly the class of 
real-valued functions on strings that can be computed by WFA. In addition to their use in spectral learning 
algorithms, weighted automata are also commonly used in other areas of pattern recognition for sequences, 
including: speech recognition [32], image compression pQ, natural language processing [28] , model checking 
[2], and machine translation [22] , 

Part of the appeal of spectral learning techniques comes from their computational superiority when 
compared to iterative algorithms like Expectation-Maximization (EM) [23) . Another very attractive property 
of spectral methods is the possibility of proving rigorous statistical guarantees about the learned hypothesis. 
For example, under a realizability assumption, these methods are known to be consistent and amenable to 
finite-sample analysis in the PAC sense [26] . An important detail is that, in addition to realizability, these 
results work under the assumption that the user correctly guesses the number of latent states of the target 
distribution. Though this is not a real caveat when it comes to using these algorithms in practice - the 
optimal number of states can be identified using a model selection procedure [9] it is one of the barriers in 
extending the statistical analysis of spectral methods to the non-realizable setting. 

Tackling the non-realizability question requires, as a special case, dealing with the situation in which 
data is generated from a WFA with n states and the learning algorithm is asked to produce a WFA with 
n < n states. This case is already a non-trivial problem which - barring the noisiness introduced by the use 
of statistical data instead of the original WFA can be easily interpreted as an approximate minimization 
of WFA. From this point of view, the possibility of using spectral learning algorithms for approximate 
minimization of a small class of hidden Markov models has been recently considered in 30] - This paper 
also presents some restricted theoretical results bounding the error between the original and minimized 
HMM in terms of the total variation distance. Though incomparable to ours, these bounds are the closest 
work in the literature to our approaclf] Another paper on which the issue of approximate minimization of 
weighted automata is considered in a tangential manner is [27]. In this case the authors again focus on an 
f'Mike accuracy measure to compare two automata: an original one, and another one obtained by removing 
transitions with small weights occurring during an exact minimization procedure. Though the removal 
operation is introduced as a means of obtaining a numerically stable minimization algorithm, the paper 

1 After the submission of this manuscript we became aware of the concurrent work ED, where a problem similar to the one 
considered here is addressed, albeit different methods are used and the results are not directly comparable. 
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also presents some experiments exploring the effect of removing transitions with larger weights. With the 
exception of these timid results, the problem of approximate minimization remains largely unstudied. In the 
present paper we set out to initiate the systematic study of approximate minimization of WFA. We believe our 
results - beyond their intrinsic automata-theoretic interest - will also provide tools for addressing important 
problems in learning theory, including the robust statistical analysis of spectral learning algorithms. 

Let us conclude this introduction by mentioning the potential wide applicability of our results in the 
field of algorithms for manipulating, combining, and operating with quantitative systems. In particular, the 
possibility of obtaining reduced-size models incurring a small, controlled amount of error might provide a 
principled way for speeding up a number of such algorithms. 

The content of the paper is organized as follows. Section [2] defines the notation that will be used 
throughout the paper and reviews a series of well-known results that will be needed. Section [3] establishes 
the existence of a canonical form for WFA and provides a polynomial-time algorithm for computing it (the 
first major contribution of this work). The computation of this canonical form lies at the heart of our 
approximate minimization algorithm, which is described and analyzed in Section [4] Our main theoretical 
result in this section is to establish bounds describing how well the approximation obtained by the algorithm 
emulates the behavior of the original WFA. The proof is quite lengthy and is deferred to Appendix [A] In 
Section [5] we discuss two technical aspects of our work: its relation and consequences with the mathematical 
theory of low-rank approximation of rational series; and the (ir)relevance of an assumption made in our 
results from Sections [3] and [4] We conclude with Section [6j where we point out interesting future research 
directions. 


2 Background 

2.1 Notation for Matrices 

Given a positive integer d, we denote [d] = {1, ...,d}. We use bold letters to denote vectors v £ R d 
and matrices M £ R dlXd2 . Unless explicitly stated, all vectors are column vectors. We write I for the 
identity matrix, diag(ai,..., a n ) for a diagonal matrix with aq,..., a n in the diagonal, and diag(Mi,..., M„) 
for the block-diagonal matrix containing the square matrices M, along the diagonal. The zth coordinate 
vector (0,...,0,1,0,...,0) T is denoted by e,;. For a matrix M £ R dlXd2 , i £ [dx], and j £ [cfe], we 
use M(z,:) and M(:,j) to denote the zth row and the jth column of M respectively. Given a matrix 
M £ R dlXd 2 we can consider the vector vec(M) £ R dl ' d2 obtained by concatenating the columns of M so that 
vec(M)((z— l)d 2 +j) = M(z,j). Given two matrices M £ R dlXd2 and M 7 £ R d i xd i we denote their Kronecker 
(or tensor) product by M® M' £ R dld i x.d 2 d' 2 ^ w ith en ^ r i es given by (M(g> M')((z — 1 )d[+i l , (j — l)d' 2 +j') = 
M(z, j)M'(z',/), where z £ [g?i], j £ [d-f\, i' £ [d'i], and j' £ [d' 2 \. For simplicity, we will sometimes write 
M® 2 = M® M, and similarly for vectors. A rank factorization of a rank n matrix M £ R dlXd2 is an 
expression of the form M = QR where Q £ R dlX " and R £ R nxd2 are full-rank matrices. 

Given a matrix M £ M. dlXd2 of rank n, its singular value decomposition (SVD^is a decomposition of 
the form M = UDV T where U £ R dlXn , D £ R nxn , and V £ R d2Xn are such that: U T U = V T V = I, 
and D = diag(si,... ,s n ) with Si > • • • > s n > 0. The columns of U and V are called left and right 
singular vectors respectively, and the Si are its singular values. The SVD is unique (up to sign changes in 
associate singular vectors) whenever all inequalities between singular values are strict. A similar spectral 
decomposition exists for bounded operators between separable Hilbert spaces. In particular, for finite-rank 
bounded operators one can write the infinite matrix corresponding to the operator in a fixed basis, and 
recover a concept of reduced SVD decomposition for such infinite matrices which shares the same properties 
described above for finite matrices j!55] . 

For 1 < p < 00 we will write ||v|| p for the £ p norm of vector v. The corresponding induced norm on 
matrices is ||M|| p = sup|| v | =1 ||Mv|| p . In addition to induced norms, we will also need to define Schatten 
norms. If M is a rank-n matrix with singular values s = (s-|,..., s„), the Schatten p-norm of M is given 

2 To be more precise, this is a reduced singular value decomposition, since the inner dimensions of the decomposition are all 
equal to the rank. In this paper we shall always use the term SVD to mean reduced SVD. 
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by ||M||s, p = ||s|| p . Most of these norms have given names: || • ||2 = || ■ ||s,oo = II • || op is the operator (or 
spectral) norm ; || • ||s ,2 = || • ||f is the Frobenius norm ; and || • ||s,i = || • ||tr is the trace (or nuclear) norm. 
For a matrix M the spectral radius is the largest modulus p(M) = max,; |A*(M)| among the eigenvalues of 
M. For a square matrix M, the series ]Cfc>o converges if and only if p(M) < 1, in which case the sum 
yields (I — M) -1 . 

Sometimes we will name the columns and rows of a matrix using ordered index sets X and J . In this case 
we will write M € K lx ' 7 to denote a matrix of size \I\ x \J\ with rows indexed by X and columns indexed 
by J. 

2.2 Weighted Automata, Rational Series, and Hankel Matrices 

Let E be a fixed finite alphabet with |£| = k symbols, and £* the set of all finite strings with symbols in £. 
We use A to denote the empty string. Given two strings p, s £ £* we write ui = ps for their concatenation, 
in which case we say that p is a prefix of w and s is a suffix of w. We denote by |w| the length (number of 
symbols) in a string w £ £*. Given a set of strings X C £* and a function /:£*—)• R, we denote by f(X) 
the summation Y^x&x f ( x ) if defined. For example, we will write /(£*) = X^|a;|=t f ( x ) f° r an y ^ > Cl- 

Now we introduce our notation for weighted automata. We want to note that we will not be dealing 
with weights on arbitrary semi-rings; this paper only considers automata with real weights, with the usual 
addition and multiplication operations. In addition, instead of resorting to the usual description of automata 
as directed graphs with labeled nodes and edges, we will use a linear-algebraic representation, which is more 
convenient. A weighted finite automata (WFA) of dimension n over £ is a tuple A = (ccq, a?oo ; {A„} ff gE) 
where cco £ R ra is the vector of initial weights, aoo £ R n is the vector of final weights, and for each 
symbol er £ £ the matrix A CT £ R raxn contains the transition weights associated with a. Note that in this 
representation a fixed initial state is given by cco (as opposed to formalisms that only specify a transition 
structure), and the transition endophormisms A CT and the final linear form a 00 are given in a fixed basis on 
R" (as opposed to abstract descriptions where these objects are represented as basis-independent elements 
over some n-dimensional vector space). 

We will use dim(A) to denote the dimension of a WFA. The state-space of a WFA of dimension n is 
identified with the integer set [n], Every WFA A realizes a function :£*—>■ R which, given a string 
x = X\ ■ ■ ■ Xt £ £*, produces 

f a(x) — a 0 A Xl • • ■ A xtC^oo — A- x olqq , 

where we defined the shorthand notation A x = A Xl • ■ ■ A.,,, that will be used throughout the paper. A 
function /:£*—>■ R is called rational if there exists a WFA A such that f = /a- The rank of a rational 
function / is the dimension of the smallest WFA realizing /. We say that a WFA is minimal if dim(A) = 
rank (/a). 

An important operation on WFA is conjugation by an invertible matrix. Suppose A is a WFA of dimension 
n and Q £ R nxn is invertible. Then we can define the WFA 

A , = Q- 1 AQ = (Q T a 0 ,Q- 1 a O o,{Q” 1 A ff Q}) . (1) 

It is immediate to check that /a = f a' ■ This means that the function computed by a WFA is invariant under 
conjugation, and that given a rational function /, there exist infinitely many WFA realizing /. In addition, 
the following result characterizes all minimal WFA realizing a particular rational function. 

Theorem 1 (|l2|). If A and B are minimal WFA realizing the same function, then B = Q _1 AQ for some 
invertible Q. 

A function /:£*—>■ R can be trivially identified with an element from the free vector space R s . This 
vector space contains several subspaces which will play an important role in the rest of the paper. One is the 
subspace of all rational functions, which we denote by TZ(E). Note that 1Z(T,) is a linear subspace, because 
if /, g £ 7£(£) and c £ R, then cf and / + g are both rational jX2] . Another important family of subspaces 
of R s are the ones containing all functions with finite p-norrn for some 1 < p < 00 , which is given by 
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Il/llp = Uxes* \f( x )\ P f° r finite p, and ||/||oo = sup xgl; . |/(a:)|; we denote this space by f p (£). Note that 
like in the usual theory of Banach spaces of sequences, we have £ p (£) C £ 9 (£) for p < q. Of these, £ 2 (£) can 
be endowed with the structure of a separable Hilbert space with the inner product (/, g) = f(x)g{x)- 

Recall that in this case we have the Cauchy-Schwarz inequality (f,g ) 2 < ||/||| ||g|||. In addition, we have 
its generalization, Holder’s inequality : given / £ £ p (£) and g £ £ 9 (£) with p~ l + q~ l < 1, then \\f ■ g||i < 
ll/llpllfl'llg, where (/ • g)(x) = f(x)g(x). By intersecting any of the previous subspaces with 7£(£) one obtains 
f^(£) = TZ{Ti) rH p (£), the normed vector space containing all rational functions with finite p-norrn. In most 
cases the alphabet £ will be clear from the context and we will just write 1Z, £ p , and 

The space of absolutely convergent rational series will play a central role in the theory to be developed 
in this paper. An important example of functions in is that of probability distributions over £* realized 
by WFA, also known as rational stochastic languages. Formally speaking, these are rational functions f £lZ 
satisfying the constraints f(x) > 0 and J2x /( x ) = 1- This implies that includes all functions realized by 
probabilistic automata with stopping probabilities [25j, hidden Markov models with absorbing states ;33l . 
and predictive state representations for dynamical systems with discounting or finite horizon [36j . Note that 
given a WFA A , the membership problem /a £ f n is known to be semi-decidable 0. 

Let H £ R s xE be a bi-infinite matrix whose rows and columns are indexed by strings. We say that 
H is Hanke 0 if for all strings p,p',s,s' £ £* such that ps = p's' we have H(p, s) = H(j/,s'). Given a 
function /:£*-> 1 we can associate with it a Hankel matrix H/ £ R s xE with entries H/(p, s) = f(ps). 
Conversely, given a matrix H £ xS with the Hankel property, there exists a unique function /:£*—>■ R. 
such that Hy = H. The following well-known theorem characterizes all Hankel matrices of finite rank. 

Theorem 2 ([12)). For any function /:£*—> R, the Hankel matrix Hj has finite rank n if and only if f 
is rational with rank(/) = n. In other words, rank(/) = rank(Hy) for any function f : £* —> R. 


3 A Canonical Form for WFA 


In this section we discuss the existence and computation of a canonical form for WFA realizing absolutely 
convergent rational functions. Our canonical form is strongly related to the singular value decomposition of 
infinite Hankel matrices. In particular, its existence and uniqueness is a direct consequence of the existence 
and uniqueness of SYD for Hankel matrices of functions in , as we shall see in the first part of this section. 


Furthermore, the algorithm given in Section 3.2 for computing the canonical form can also be interpreted as 
a procedure for computing the SVD of an infinite Hankel matrix. 


3.1 Existence of the Canonical Form 

A matrix T £ R s * xE * can be interpreted as the expression of a (possibly unbounded) linear operator T : 
£ 2 —> l 2 in terms of the canonical basis In the case of a Hankel matrix H f, the associated operator 

Hf is called a Hankel operator , and corresponds to the convolution-like operation ( Hfg){x) = fF y f{ x v)g(y) 
(assuming the series converges). 

Recall the operator norm of T : £ 2 —► l 2 is defined as ||Xj| op = sup 11 / 11 2 ci ||T/|| 2 . An operator is bounded 
if ||T|| op is finite. Although not all Hankel operators are bounded, next lemma gives a sufficient condition 
for Hf to be bounded. 

Lemma 3. If f £ l 1 , then Hf is bounded. 

Proof. Let h(x) = 1 + |x| and note that / £ i 1 implies sup^, |/(a;)|(l + |x|) < oo; i.e. / • h £ £°°. Now let 

An real analysis a matrix M is Hankel if Mfi, j) = M(fc, Z) whenever i + j = k-i-l, which implies that M is symmetric. In our 
case we have H(p, s) = Hwhenever ps = p's', but H is not symmetric because string concatenation is not commutative 
whenever |E| > 1. 
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g £ (? with ||g|| 2 = 1 and for any x £ E* define the function f x (y) = f{xy). Then we have 

\\H f g\\l = Y^ =^2(fx,g) 2 

< mi yi 

x x y 

= £(i + \z\)f(zf = l/WIK 1 + W)/(*)l 

2 : 2 : 

< ll/lll 11/ • /l||oo < OO , 

where we used Caucliy-Schwarz inequality, that the number different ways to split a string 2 into a prefix 
and a suffix equals 1 + \z\, and Holder’s inequality. This concludes the proof. □ 

Theorem [ 2 ] and Lemma [ 3 ] imply that, for any / £ 4) the Hankel matrix H f represents a bounded finite- 
rank linear operator Hf on the Hilbert space £ 2 . Hence, H/ admits a reduced singular value decomposition 
H/ = UDV t where U,V £ xn and D £ R nxn with n = rank(/). The Hankel singular values of 
a rational function / e 4 are defined as the singular values of the Hankel matrix Hj. These singular 
values can be used to define a new set of norms on the Schatten-Hankel p-norm of / £ .4 is given by 
||/||h,p = ||H/||s, p = ||(si,... ,s n )|| p . It is straightforward to verify that || • ||h, p satisfies the properties of a 
norm. 

Note an SVD of H/ yields a rank factorization given by H/ = (UD 1,/2 )(VD 1 / 2 ) T . But SVD is not the 
only way to obtain rank factorizations for Hankel matrices. In fact, if / is rational, then every minimal WFA 
A realizing / induces a rank factorization of as follows. Let P 4 £ x " be the forward matrix of A given 
by P a{Pj '■) = for any string p £ E*. Similarly, let Sa £R S *xn be the backward matrix of A given by 

S^s,:) = (A s a 00 ) T for any string s £ E*. Since H f (p,s ) = f(jps) = a^ApA^oc = P A (p, :)Sj(:, s), we 
obtain H f = P^Sj. This is known as the forward-backward (FB) rank factorization of Hy induced by A 
The following result shows that among the infinity of minimal WFA realizing a given rational function 
/ £ 4 > there exists one whose induced FB rank factorization coincides with H/ = (UD 1 ^ 2 )(VD 1 / 2 ) T . 

Theorem 4. Let f £ and suppose H/ = (UD^ 2 )(VD^ 2 ) T is a rank factorization induced by SVD. 
Then there exists a minimal WFA A for f inducing the same rank factorization. That is, A induces a FB 
rank factorization ofHf given by Pa = UD 1 / 2 and Sa = VD 1 / 2 . 

Since we have already established the existence of an SVD for H/ whenever / £ the theorem is just 
a direct application of the following lemma. 

Lemma 5. Suppose f £ and Hy = PS T is a rank factorization. Then there exists a minimal WFA A 
realizing f which induces this factorization. 

Proof. Let B be any minimal WFA realizing / and denote n = rank(/). Then we have two rank factorizations 
PS T = PsSj for the Hankel matrix H/. Therefore, the columns of P and P^ both span the same n- 
dimensional sub-space of R s , and there exists a change of basis Q £ R" xn such that PsQ = P. This 
implies we must also have S T = Q _1 Sj. It follows that A = Q -1 BQ is a minimal WFA for / inducing the 
desired rank factorization. □ 

The results above leads us to our first contribution: the definition of a canonical form for WFA realizing 
functions in 

Definition 6. Let f £4. A singular value automaton (SVA) for f is a minimal WFA A realizing f such 
that the FB rank factorization ofHf induced by A has the form given in Theorem [7J 

Note the SVA provided by Theorem [4] is unique up to the same conditions in which SVD is unique. In 
particular, it is easy to verify that if the Hankel singular values of / e 4 satisfy the strict inequalities 
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Si > • • • > s n , then the transition weights of the SVA A of / are uniquely defined, and the initial and final 
weights are uniquely defined up to sign changes. 

Then next subsection gives a polynomial-time algorithm for computing the SVA of a function / 6 4 
starting from a WFA realizing /. 

3.2 Computing the Canonical Form 

As we have seen above, a bi-infinite Hankel matrix Hj of rank n can actually be represented with the 
n (2 + kn) parameters needed to specify the initial, final and transtion weights of a minimal WFA A realizing 
/. Though in principle A contains enough information to reconstruct H/, a priori it is not clear that A 
provides an efficient representation for operating on Hj. Luckily, it turns out WFA possess a rich algebraic 
structure allowing many operations on rational functions and their corresponding Hankel matrices to be 
performed in “compressed” form by operating directly on WFA representing them |l2j . In this section we 
show it is also possible to compute the SVD of Hy by operating on a minimal WFA realizing /; that is, we 
give an algorithm for computing SVA representations. 

We start with a simple linear algebra fact showing how to leverage a rank factorization of a given matrix 
in order to compute its reduced SVD. Let M £ R pxs be a matrix of rank n and suppose M = PS T is a 
rank factorization. Let G p = P T P £ R raxn be the Gram matrix of the columns of P. Since G p is positive 
definite, it admits a spectral decomposition G p = V p D p V^. Similarly, we have G s = S T S = V s D s Vj. 
With this notation we have the following. 

Lemma 7. Let M = Dp /2 VjV s D, /2 with reduced SVD M = UDV T . If Q p = V p T> p 1/2 U, U = PQ p , 
Q s = V s D 7 1/2 V, V = SQ s , and D = D, then M = UDV T is a reduced SVD for M. 

Proof. We just need to check the columns of U and V are orthonormal, and M = UDV T : 

U T U = P T PQ p 

= U T D p V 2 V p GpV p Dp V 2 U 

= U T D~ 1 / 2 Vp VpDpVp VpDp 1//2 U 
= U T U 
= 1 , 

V T V = QjS T SQ s 

= V T D7 1 / 2 Vj GsVsD^V 

= v t D7 1/2 vJ v s d s vJ v s D7 1/2 v 

= V T V 

= 1 , 

UDV t = PQpDQj S T 

= PVpDp 1/2 UDV T D7 1/2 Vj s t 
= PVpD” 1 / 2 MD” 1 / 2 Vj s t 
= PVpD” 1 / 2 Dp/ 2 Vj V s Dy 2 D“ 1 / 2 Vj S T 

= PS T 
= M . 


□ 

Note the above result we does not require p and s to be finite. In particular, when M is an infinite matrix 
associated with a finite-rank bounded operator, the computation of Q p and Q s can still be done efficiently 
as long as G p and G s are available. 
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Our goal now is to leverage this result in order to compute the SVD of the bi-infinite Hankel matrix H f 
associated with a rational function / e4- The key step will be to compute the Gram matrices associated 
with the rank factorization induced by a minimal WFA for /. We start with a lemma showing how to 
compute the inner product between two rational functions. 

Lemma 8. Let A = (qo, a^, {A CT }) and B = (f3 Q , [3^, {Bo-}) be minimal WFA realizing functions f a , /s £ 
£\. Suppose the spectral radius of the matrix C = A CT ® B CT satisfies p{ C) < 1. Then the inner product 
between f a and fs can be computed as: 

(/ A ,/ B ) = (ao®/3 0 ) T (I-C)- 1 (a 00 ®/3 00 ) . 

Proof. First note g{x) = fA(,x)fs(x) is in i 1 by Holder’s inequality. Therefore (/a,/b) = Y^ x 9( x ) — 
15(4 < °°- I n addition, g is rational [32J and can be computed by the WFA C = (7 0 ,7oo; {Co-}) given 
by 

7 0 = c* 0 <S> Po , 

7oo = «oo ® /3oo , 

C (j — A ff (g) Bo- . 

Now note one can use a simple induction argument to show that for any finite t > 0 we have 

St = 4 C2:700 = 7o~C*7oo ■ 

Because g £ i 1 , the series X)t>o St * s absolutely convergent. Thus we must have lim*,-^ Ylt<k St = ^ ^ or 
some finite L £ M. Since p( C) < 1 implies the identity X)t>o C* = (I — C) _1 , we must necessarily have 

L = 7 J(I-C)- 1 7oo . ” □ 

Note the assumption p{ C) < 1 is an essential part of our calculations. We shall make similar assumptions 
in the remaining of this section. See Section [5.2| for a discussion on this assumption and how to remove it. 

The following result shows how to efficiently compute the Gram matrices associated with the rank fac¬ 
torization induced by a minimal WFA for a function / e 4- 

Lemma 9. Let f e 4 with rank(/) = n, and A = (a 0 , a^, {A ff }) be a minimal WFA for f inducing 
the FB rank factorization Hy = PS T . Let A® = If p( A®) < 1, then the Gram matrices 

G p , G s £ R nx ” associated with the factorization induced by A satisfy vec(G p ) T = (o;® 2 ) T (I — A®) -1 and 
vec(G s ) = (I-A®)- 1 a® 2 . 

Proof. For i £ \n] let p; = P(:,*) £ be the ith column of P. The key observation is that the function 
Pi : E* — > R. defined by Pi(x) = Pi{x) is in t\. To show rationality one just needs to check is the function 
realized by the WFA A Pt i = (ao,ei,{A ff }) by construction of the induced rank factorization. The fact that 
\\pi 11 2 is finite follows from Theorem [4] by noting that p, is a linear combination of left singular vectors of 
Hf. which belong to £ 2 by definition. Thus, G p (i,j) = P^Pj i s well-defined and corresponds to the inner 
product ( Pi,Pj) which, by Lemma [8] can be computed as 

(af) T (l-A«) 4 (e,ge J ) . 

Since e,; (g) ej = we obtain the desired expression for vec(G p ). The expression for vec(G s ) follows 

from an identical argument using automata A, s/i = (e*, a x , {A CT }). □ 

Combining the results above we now show it is possible to compute an SVA for / 64 starting from a 
minimal WFA realizing /. The procedure is called ComputeSVA and its description is given in Algorithm [l] 

Theorem 10. Let A = (ao, c*oo ; {A ff }) be a minimal WFA for f such that A® = A® 2 satisfies p( A®) < 

1. Then the WFA A' computed by ComputeSVA(A) is an SVA for f. 


Algorithm 1: ComputeSVA 




Input: A minimal WFA A with n states for / G £]i 

Output: An SVA A! for / 




l Compute G p and G s 

/* cf. 

Lemma 

) */ 

2 Compute Qp, Q s , and D 

/* cf. 

Lemma 

‘ */ 

s Let A! = D 1 / 2 QjAQpD 1 / 2 

4 return A! 

/* cf. 

Eq. (JTj) */ 


Proof. Let Q = QpD 1 / 2 . Our first observation is that Q 1 = D 1,/2 Q7 and thus A and A' are minimal WFA 
for /. Indeed, we already showed in the proof of Lemma [7] that 

Q p D 1/2 D 1/2 QJ = QpDQj = I 

In addition, it is immediate to check that if A induces the rank factorization H/ = PS T , then A ' induces 
the rank factorization Hy = (PQ p D 1 / 2 )(D 1 / 2 QjS T ), which by Lemma [ t] satisfies PQpD 1 / 2 = UD 1 / 2 and 
D 1 / 2 Q7S T = D 1 / 2 V T . □ 

We conclude this section by mentioning that it is possible to modify ComputeSVA to take as input a 
non-minimal WFA A realizing a function / e4 under the same assumption on the spectral radius of the 
matrix A® as we have here. We shall present the details of this modification somewhere else. Nonetheless, 
we note that if given a non-minimal WFA A, one always has the option to minimize A (e.g. using the WFA 
minimization algorithm in m) before attempting the SVA computation. 

3.3 Computational Complexity 

To bound the running time of ComputeSVA(A) we recall the following facts from numerical linear algebra 
(see e.g. [37]): 

• The SVD of M G R dlXd2 > ^ 2 ) can be computed in time 0(did^). 

• The spectral decomposition of a symmetric matrix M G R. dxd can be computed in time 0(d 3 ). 

• The inverse of an invertible matrix M G R. dxd can be computed in time 0{d 3 ). 

Now note that according to Lemma k)l computing the Gram matrices requires 0(kn 4 ) operations to 
obtain I — A®, plus the inversion of this n~ x n 2 matrix, which can be done in time 0(n e ). From Lemma[7] 
we see that once the n x n Gram matrices G p and G s are given, then computing the singular values D 
and the change of basis matrices Q p and Q s can be done in time 0(n 3 ). Finally, the cost of conjugating 
the WFA A into A' takes time 0(kn 3 ), where k = |£| and n = dim(A). Hence, the overall running time 
of ComputeSVA(A) is 0(n 6 + kn 4 ). Of course, this is a rough estimate which does not take into account 
improvements that might be possible in practice, especially in those cases where the transition matrices of 
A are sparse - in such case the complexity of most operations could be bounded in terms of the number of 
non-zeros. 


4 Approximate Minimization of WFA 

In this section we describe and analyze an approximate minimization algorithm for WFA. The algorithm 
takes as input a minimal WFA A with n states and a target number of states n, and outputs a new WFA A 
with n states approximating the original WFA A. To obtain A we first compute the SVA A' associated to A, 
and then remove the n — n states associated with the smallest singular values of Hf A . We call this algorithm 
SVATruncation (see Algorithm [2] for details). Since the algorithm only involves a call to ComputeSVA and a 
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Algorithm 2: SVATruncation 

Input: A minimal WFA A with n states, a target number of states h < n 
Output: A WFA A with h states 

1 Let A' <— ComputeSVA(A) 

2 Let n = [I ft 0] e R" xri 

3 Let A ct = IIA , cr II T for all a £ £ 

4 Let &0 = n a ; 

5 Let Aoo = II a ^ 

e Let A = (a 0 , Hoc, {A CT }) 

7 return A 


simple algebraic manipulation of the resulting WFA, the running time of SVATruncation is polynomial in 
|£|, dim(A) and h. 

Roughly speaking, the rationale behind SVATruncation is that given an SVA, the states corresponding 
to the smallest singular values are the ones with less influence on the Hankel matrix, and therefore should 
also be the ones with less influence on the associated rational function. However, the details are more tricky 
than this simple intuition. The reason being that a low rank approximation to Hj obtained by truncating 
its SVD is not in general a Hankel matrix, and therefore does not correspond to any rational function. In 
particular, the Hankel matrix of the function / computed by A is not obtained by truncating the SVD of 
Hy. This makes our analysis non-trivial. 

The main result of this section is the following theorem, which bounds the £ 2 -distance between the 
rational function / realized by the original WFA A , and the rational function / realized by the output WFA 
A. The principal attractive of our bound is that it only depends on intrinsic quantities associated with the 
function /; that is, the final error bound is independent of which WFA A is given as input. To comply with 
the assumptions made in the previous section, we shall assume like in previous section that the input WFA 


A satisfies p{ A®) < 1. The same precepts about this assumption discussed in Section 5.2 apply here. 


Theorem 11. Let f e4 with rank(/) = n and fix 0 < n < n. If A is a minimal WFA realizing f and 
such that p(A ’) < 1, then the WFA A = SVATruncation(A, n) realizes a function f satisfying 


11 / — /111 < Cfy/Sfi + X + • • • + S„ 


( 2 ) 


where Cf is a positive constant depending only on f. 

A few remarks about this result are in order. The first is to observe that because Si > • • • > s n , the 
error decreases when n increases, which is the desired behavior: the more states A has, the closer it is to A. 
The second is that |2]) does not depend on which representation A of / is given as input to SVATruncation. 
This is a consequence of first obtaining the corresponding SVA A! before truncating. Obviously, one could 
obtain another approximate minimization by truncating A directly. However, in that case the final error 
would depend on the initial A and in general it does not seem possible to use this approach for providing 
representation independent bounds on the quality of approximation. 

The main bulk of the proof of Theorem |TT] is deferred to Appendix [X] Here we will only discuss the basic 
principle behind the proof and a key technical lemma which highlights the relevance of the SVA canonical 
form in our approach. 

The first step in the proof is to combine A' and A into a single WFA B computing f B = (f — /) 2 , and 
then decompose the error as 

ii/ - f\\l = 

t>0 VrrGS* 

One can then proceed to bound /#(£*) for all t > 0 in terms of the weights of A this involves lengthy 
algebraic manipulations with many intermediate steps exploiting a variety of properties of matrix norms and 
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Kronecker products. The last and key step is to exploit the internal structure of the SVA canonical form in 
order to turn these bounds into representation independent quantities. This part of the analysis is based on 
the following powerful lemma. 

Lemma 12. Let A = (oto, a^, {A^}) be an SVA with n states realizing a function f e 4 with Hankel 
singular values s± > ■ ■ ■ > s n . Then the following are satisfied: 

1. For all j G [n], E, A Ea A <r(*> j) 2 = s, - a 0 (j) 2 , 

2. For all i G [n\, J2j s j Ea A a (i, j) 2 = Si ~ «oo(*) 2 - 

Proof. Recall that A induces the rank factorization Hy = PS T = (UD 1 / 2 )(D 1 / 2 V T ) corresponding to the 
SVD of H/. Let p 7 be the jth column of P = [pi • • • p„] and note we have ||pj||| = s j- By appropriately 
decomposing the sum in 11 py 11 § we get the following 

Sj = Pi (A) 2 + J2Y1 V 3 {xa) 2 . (3) 

Let us write pj for the element of £ 2 (E) given by pj(a;) = p j{xa). Note that by construction we have 
pf = P A cr(:,j) = Eie[nl Pi A o-(b j)- Since A is an SVA, the columns of P are orthogonal and therefore we 
have 

IIpJIII = (^Pi A cr(i,j),J2Pi /A M^) S j 

= ^2A a (i,j)A lT (i , ,j)(p i ,pi l ) 

i,i' 

= y^SjA a (i,j) . 

i 

Plugging this into (|3| and noting that p ; (A) = a 0 (j), w e obtain the hrst claim. The second claim follows 
from applying the same argument to the columns of S. □ 


To see the importance of this lemma for approximate minimization, let us consider the following simple 
consequence which can be derived by combining the bounds for A a (i, j) obtained from considering it belongs 
to the ith row and the jth column of A CT : 


\A a (i, j)\ < min 



I min{St,gj} 
maxjs,;, Sj } 


This bound is telling us that in an SVA, transition weights further away from the diagonals of the A CT 
are going to be small whenever there is a wide spread between the largest and smallest singular values; 
for example, |A cr (l,n)| < y/s n /si_. Intuitively, this means that in an SVA the last states are very weakly 
connected to the first states, and therefore removing these connections should not affect the output of the 
WFA too much. Our proof in Appendix [A] exploits this intuition and turns it into a definite quantitative 
statement. 


5 Technical Discussions 

This section collects in-detail discussions about two technical aspects of our work. The first one is the 
relation and consequences of our results with respect to the mathematical theory of low-rank approximation 
of rational series. The second part makes some remarks about the assumption on the spectral radius of WFA 
made in our results from Sections [3] and |4j 

4 Here we are implicitly using the fact that V.,. p, (.t ) 2 is absolutely (and therefore unconditionally) convergent, which implies 
that any rearrangement of its terms will converge to the same value. 
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5.1 Low-rank Approximation of Rational Series 

We have already discussed how the behavior of the bound ([2]) matches what intuition suggests. Let us now 
discuss a little bit more about the quantitative aspects of the bound. In particular, we want to make a 
few observations about the connection of <© with low-rank approximation of matrices. We hope these will 
shed some light on the mathematical theory of low-rank approximation of rational series, and its relations 
with low-rank approximations of infinite Hankel matrices - a question which certainly deserves further 
investigation. 

Recall that given a rank-ro matrix M £ R dlXd2 anc [ some \ < fi < n, the matrix low-rank approximation 
problem asks for a matrix M attaining the optimal of the following optimization problem: 

min ||M-M'|| f . 

rank(M')<n 

It is well-known the solution to this problem can be computed using the SVD of M and satisfies the following 
error bounds in terms of Scliatten p-norms: 

||M — M||s, p = ||(Sft+i, • • ■ ,s n )||p ■ 


Using these results it is straightforward to give lower bounds on the approximation errors achievable by 
low-rank approximation of rational series in terms of Schatten-Hankel norms (cf. Section 3.11. Let 1 < p < oo 
and suppose / € ^7 Z has rank n and Hankel singular values Si > • • • > s n . Then the following holds for every 
f £ with rank(/') < n: 

||/-/ , ||h,p > ||(s fl+ i,...,s n )||p . (4) 

On the other hand, we define the optimal t 2 approximation error of / with respect to all rational functions 
of rank at most n as 

4 pt = Jf-f'h ■ 

rank(/' )<n 

It is easy to see the infimum will be attained at some /? pt £ K f a 3 e denotes the function realized 
by the solution obtained from our SVATruncation algorithm, then Theorem ED implies the bound 


sT<\\f-frh<c 


1 / 2 , 


2 ^ ^ f ||t S n+l) • ■ ■ ,S n ; ||i 


1/4 


(5) 


Combining the bounds 0 and © above, we can conclude that the performance of our approximation 
/| va with respect to / and /? pt can be bounded as follows: 


\\f-fTh < Wf-frh < c) / 2 \\f- fT\\Z ■ 

In future work we plan to investigate the tightness of these bounds and the computational complexity of 
(approximately) computing f° pt . 


5.2 Spectral Radius Assumptions 

The algorithms presented in Sections [3] and [4] assume their input is a WFA A = (a 0 , a^,, {A CT }) such that 
A® = has spectral radius p( A®) < 1. This condition is used in order to guarantee the existence 

of a closed-form expression for the summation of the series ^ t>0 (A®)*. Algorithm ComputeSVA uses this 
expression for computing the Gram matrices G p = and G s = associated with the FB rank 

factorization H f A = P^S^ induced by A. A first important remark is that since p( A®) is defined in terms 
of the eigenvalues of A®, the assumption can be tested efficiently. The rest of this sections discusses the 
following two questions: (1) is the assumption always true in general? and, (2) if not, is there an alternative 
way to compute the Gram matrices needed by ComputeSVA? 

Regarding the first question, let us start by pointing out a natural way in which one could try to prove 
that the assumption always hold. This approach is based on the following result due to F. Denis {Mi . 
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Proposition 13. Let A = (olq, {A ff }) be a minimial WFA realizing /a e4- Then the spectral radius 
of A = Act satisfies p( A) < 1. 


In view of this, a natural question to ask is whether the fact p{ A) < 1 implies p( A®) < 1. While this 
follows from the equation p(M (g) M) = p(M) 2 in the case with |E| = 1, the result is not true in general 
for arbitrary matrices. In fact, obtaining interesting bounds on the spectral radius of matrices of the form 
Mi ® Mi + M 2 <g> M 2 is an area of active research in linear algebra m- Following this approach would 
require proving new bounds along these lines that apply to matrices defining WFA for absolutely convergent 
rational series. An alternative approach based on Proposition [13] could be to show that the automaton 
computing f\ obtained in Lemma [8] is minimal. However, this is not true in general as witnessed by the 
following example. Let A be the WFA over E = {a, 6} with 2 states given by: 


= [1 o] 


<4 = [1/3 1/3] 


A. = 


A/, = 


0 

1/3 

-1/3 

0 


1/3 ' 
0 

0 

1/3 


Note that ||/a||i = 1 and therefore we have /a e 4- It is easy to see, by looking at the rows of H f A 
corrsponding to prefixes A and a, that rank(/A) = 2; thus, A is minimal. On the other hand, one can check 
that /a(*) 2 = 3 _2 d x l +1 ). Thus, f\ has rank 1 and the 4-state WFA for f\ constructed in Lemma [8] is 
not minimal. In conclusion, though we have not been able to provide a counter-example to the fact that 
p( A®) < 1 when A is a minimal WFA realizing a function /a £ 4> we suspect that making progress on this 
problem will require a deeper understanding of the structure of absolutely convergent rational series. 

The second question is whether it is possible to compute an SVA efficiently for a WFA such that p(A®) > 
1. The key ingredient here is to provide an alternative way of computing the Gram matrices required in 
ComputeSVA. A first remark is that such Gram matrices are guaranteed to exist regardless of whether the 
assumption on the spectral radius of A® holds or not; this follows from the proof of Lemma[9] It also follows 
from the same proof that each entry G p (i,j) corresponds to the inner product ( Pi,Pj) between two rational 
functions in the same holds for the entries of G s . This observation suggests that, instead of computing 
the Gram matrices in “one shot” as in Lemma |9j it might be possible to compute each entry G p (i,j), 
1 < i < j < n, separately - note the constraint on the indices exploits the fact that G p is symmetric. This 

{A ct }> computes Pi for all i £ [n]. Then the function 
c® 2 ,e.j ® ej, {A® 2 }) with n 2 states. Now observe that 


can be done as follows. Recall that A pi = (a 0 ,i 
f P ,i,j = Pi ■ Pj is computed by the WFA 
by Holder’s inequality we have fp.i.j e 4- Therefore, if A.. pi j = (d 0 , A, 
with Tank(f Pt ij) states, then we must have p( A) < 1 by Proposition 
argument as in Lemma [8] we can conclude that 
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, {A^}) is a minimization of A p i j 
where A = A^. Using the same 


G p {i,j) = (pi,Pj) 

= fpd,A x ) 

igE* 

= do (I - A^doo 


This gives an alternate procedure for computing G p and G s which involves @(n 2 ) WFA minimizations 
of automata with n 2 states, each of them taking time 0(n 6 ) (cf. [H]). Hence, the cost of this alternate 
procedure is of order 0(n 8 ), and should only be used when it is not possible to use the 0(n 6 ) procedure 
given in Section [372} 


13 







6 Conclusions and Future Work 


With this paper we initiate a systematic study of approximate minimization problems of quantitative sys¬ 
tems. Here we have focused our attention on weighted finite automata realizing absolutely convergent rational 
series. These are, of course, not all rational series but include many situations of interest, for example, all 
fully probabilistic automata. We have shown how the connection between rational series and infinite Hankel 
matrices yields powerful tools for analyzing approximate minimization problems for WFA: the singular value 
decomposition of infinite Hankel matrices and the singular values themselves. Our first contribution: an 
algorithm for computing the SVD of an infinite Hankel matrix by operating on its “compressed” representa¬ 
tion as a WFA uses these tools in a crucial way. Such a decomposition leads us to our second contribution: 
the definition of the singular value automaton (SVA) associated with a rational function /. SVA provide 
a new canonical form for WFA which is unique under the same conditions guaranteeing uniqueness of the 
SVD decomposition for Hankel matrices. We were also able to give an efficient algorithm for computing the 
SVA of a rational function / from any WFA realizing /. 

Our second set of contributions are related to the application of SVA canonical forms to the approx¬ 
imate minimization of WFA. The algorithm SVATruncation and the corresponding analysis presented in 
Theorem EH provide novel and rigorous bounds on the quality of our approximations measured in terms of 
11/ — /1| 2 ? the P norm between the original and minimized functions. The importance of such bounds lies in 
the fact that they depend only on intrinsic quantities associated with /. 

The present paper opens the door to many possible extensions. First and foremost, we will seek further 
applications and properties of the SVA canonical form for WFA. For example, a simple question that remains 
unanswered is to what extent the equations in Lemma [12] are enough to characterize the weights of an SVA. 
In the near future we are also interested in conducting a thorough empirical study by implementing the 
algorithms presented here. This should serve to validate our ideas and explore their possible applications 
to machine learning and other applied fields where WFA are used frequently. We will also set out to study 
the tightness of the bound in Theorem m in practical situations, and conjecture further refinements if 
necessary. It should also be possible to extend our results to other classes of systems closely related to 
weighted automata. In particular, we want to study approximate minimization problems for weighted tree 
automata and weighted context-free grammars, for which the notions of Hankel matrix can be naturally 
extended. Along these lines, it will be interesting to compare our approach to recent works that improve 
the running time of parsing algorithms by reducing the size of probabilistic context-free grammars using 
low-rank tensor approximations [21 , 20]. 

Though we have not emphasized it in the present paper, this work is inspired, in part, from the general 
co-algebraic view of Brzozowski-style minimization m- We have expressed everything in very concrete 
matrix algebra terms because we are using the singular value decomposition in a crucial way. However, there 
are other minimization schemes for other types of automata coming from other dualities [13] for which we 
think similar approximate minimization schemes can be developed. A general abstract notion of approximate 
minimization is, of course, a very tempting subject to pursue and, after we have more examples, it would be 
certainly high on our agenda. For the moment, however, we will concentrate on concrete instances. 
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A Proof of Theorem fTTI 

Unless stated otherwise, for the purpose of this appendix ||v|| always denotes the Euclidean norm on vectors 

and ||M|| = sup|| v || =1 ||Mv|| denotes the corresponding induced norm on matrices. We shall use several 

properties of these norms extensively in the proof. The reader can consult m for a comprehensive account 

of these properties. In here we will just recall the following: 

1. ||MN|| < ||M||||N||, 

2. ||UMU t || = ||M|j whenever UU T = U T U = I, 
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3. || diag(M,N)|| =max{||M||,||N||}, 


4. ||M||<||M|| F = v /E, J M(*,i) 2 , 

5. ||M«S>N|| = ||M||||N||. 


We start by noting that, without loss of generality, we can assume the automaton A given as input to 
SVATruncation is in SVA form (in which case A' = A). 

Now we introduce some notation by splitting the weights conforming A into a block corresponding to 
states 1 to h, and another block containing states h + 1 to n. With this, we write the following: 


C*o 


^ r x 


A ff 


( 1 ) ( 2 ) 

“0 «0 


a (1) 

tJ-oo 

a to 

A< n) 

a( 21 ) 


Ai 12) 

A (“I 


It is immediate to check that A = SVATruncation(A, h) is given by cco = a:^, = a °o , and A a = A^ 11 ^ 

For simplicity of notation, we assume here a throughout the rest of the proof that initial weights of WFA 
are given as row vectors. 

Although A has n states, it will be convenient for our proof to find another WFA with n states computing 
the same function as A. We call this WFA A, and its construction is explained in the following claim, whose 
proof is just a simple calculation. 

Claim 1. Let A = (a 0 , a^, {A CT }) be the WFA with n states given by a 0 = a. 0l 


&-00 — 

r a « i 

*-*-00 

0 

A ff = 

' A< n > 



Then f = f A = f. 

By combining A and A we can obtain a WFA computing squares of differences between / and /. The 
construction is given in the following claim, which follows from the same argument used in the proof of 
Lemma [8] 

Claim 2. Let B = ((3® 2 , /3® 2 , {B® 2 }) be the WFA with 4 n 2 states constructed from 


0o 

/3oo 

B a 


[a 0 ~ «o] 

a ct 0 
0 a ct 


diag(A CT , Ac,) . 


Then = f) 2 . 
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From the weights of automaton B we define the following vectors and matrices: 


7o = «o ® [<*o - «o] = <*o ® /3 0 > 


7oo = “oo <S) 
7oo = «oo ® 

c a = A„ g> 

C a = A.Q- g> 


= a 00 ®(3 00 
= a 00 ®f3 00 


A a 0 , 
0 A ff 

0 

0 A ff 


Afj (g B(j , 

A ff g) Bcr . 


We will also write C = C a and C = This notation lets us state the following claim, which 

will be the starting point of our bounds. The result follows from the same calculations used in the proof of 
Lemma [5] and the observation that /3® 2 = [ 7 0 — 7 0 ]. 

Claim 3. For any t > 0 we have 


A t= (/fa) ~ /fa)) 2 = 7o ( C *7oo - C^oo) 

ieE f 


Note that we can write the error we want to bound as ||/ — /1|| = A t- Our strategy will be to 

obtain a separate bound for each A t and then sum them all together. We start by bounding |A t | in terms 
of the norms of the matrices and vectors defined above. 


Lemma 14. For any t > 0 the following bound holds: 

|A*|<||7olll|C||l7oc-7oJ 

+ t||7ollll7ool|max{||C||,||C||} t - 1 ||C-C|| . 

Proof. Let us start by remarking that the proof of the bound does not depend in any way on the identity of 
the vectors and matrices involved. Thus, we shall write A* = A t ( 7 00 , 7 00 ), which will allow us later in the 
proof to change the identity of the vectors 7 ^ and 7 ^ appearing in the bound. 

Now we proceed by induction on t. The base case t = 0 is immediate since 

l A o| = l7o(7oo - 7oo)l < Il7ollll7oo - 7oJ • 

Now assume the bound is true for A f . For the case t + 1 repeated applications of the triangle inequality 
yield: 

|A t+1 | = |7 0 (C t+1 7oo-C t+1 7oo)l 

< l7 0 C‘ +1 (7oo -7oo)l 
+ |7o(C t+1 -C t + 1 ) 7oo | 

< l7 0 C t+1 (7oo -7oo)l 
+ |7oC t (C-C) 7oo | 

+ |7o(C t -C t )C 7oo | 

<ll7olll|C|r +1 ||7oo-7ooll 
+ ||7olll|C|H|C-C||||7oJ 

+ |A*(C 7oo , C 7oo )| . ( 6 ) 
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Let A' t = A t (C 7 00 , 07 ^). By our inductive hypothesis we have 

K|<||7olll|C||1C 7oo -C 7oo || 

+ t|| 7 olll|C 7 oo ||max{||C||,||C|!} t - 1 ||C-C|| 
— ^Il7ollll7ooll m ax{ 11 C 11 , ||C|| } 4 1 | C C|| . 
Finally, plugging this bound into © we get 

|A t+1 |<||7 0 ||||C|| t+1 ||7oo-7ooll 

+ ||7olll|C|| t ||C-C|||| 7o J 

+ t||7ollll7ool|max{||C||,|!C|!} t ||C^C|| 

= ll7olll|C|| t+ 1 ||7oo-7ooll 
+ (i + l)ll 7 ollll 7 ool|max{||C||,||C|!r||C-C|| 


□ 


Now we proceed to bound all the terms that appear in this bound. First note that the bounds in the 
following claim are clear by definition. 

Claim 4. We have || 7o || = V2\\a 0 \\ 2 and || 7oo || < v^ll^ooll 2 - 

The next step is to bound the term involving the norms ||C|| and ||C||. This will lead to the definition of 
a representation independent parameter we call pf. 

Lemma 15. Let pf = || A a ® Aa.||. Then pj is a positive constant depending only on f which satisfies: 


l|C|| < ||C||= P/ . 

Proof. We start by noting that ||C|| = max{|| A^A^H, || A^fgiAo-H}. Then we use || A^ActU = 

II A«t ® A ct || = max{|| A ff U) ® A ct ||, || A ^ 2) ® A <rll} t0 show that ll c ll = II a <j ® AJ, Since 

the rest of terms in the maximum correspond to norms of submatrices of A a ® A a . 

Now a similar argument can be used to show that 


||C|| = max{|| Y, A t U) ® A J, II E A E ® A^ll, 

a a 

IIEE 22 ) ® a J,IIEE 22 ) ® a -II} 

<7 (7 

< iiE A -® A -n • 
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Note pj is representation independent because it only depends on the transition weights of the SVA form 
of /, which is unique. 

□ 


In the remaining of the proof we will assume the following holds. 

Assumption 1. The SVA A is such that pf = || Af 2 || < 1- 

This assumption is not essential, and is only introduced to simplify the calculations involved in the proof. 
See the remarks at the end of this appendix for a discussion on how to remove the assumption. 

In order to bound || 7oo — 7oo || and ||C — C|| we will make an extensive use of the properties of SVA 
given in Lemma 12 This will allow us to obtain bounds that only depend on the Hankel singular values of 


/, which are intrinsic representation-independent quantities associated with /. 
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Lemma 16. 


ll7oo -7ooll < V^llaooll-v/Sft+l H-+5n ■ 

Proof. We start by unwinding the definitions of 7 ^ and 7 ^ to obtain the bound: 


ll7oo -7ooll = ll(«oo — C*oo) <8> /3oo II 

= ||«00 ~ «Oo|| 11/3(50 II 

= ||a^|| v /||a 00 || 2 + ||«oo || 2 

< V2||aoo||||a^|| , 


where we used that Udoo II < IK 
i G [n]. We use this last observation to obtain the following bound and complete the proof: 


Now note that Lemma 12 yields the crude estimate cx.oo{i) 2 < s* for all 


lag'll = 


\ 


y O-ooif) 2 < \/5n+1 + ' ' ’ + ' 


z=n+l 


□ 


Lemma 17. 



Proof. Let T = C — C = Yf a {A a - A a ) <g> B a . By expanding A a — A a in this expression one can see that 


0 r (12) ' 

r (21) 0 


where r A„^ <g> B CT for ij £ {12,21}. Since both r*' 12 -’ and I^ 21 ^ are unitarily equivalent to 

block-diagonals matrices, T is also unitarily equivalent to a block-diagonal matrix. Thus, we have 

l|L|| = max 11| ]T A^ ® AJ, || £ A^ 12 ) ® A^ 11 )||, 

v cx a 

iiy><y>®A 7 >ii,iiy>< 7 ®Aj, 

a a 

I|£a™®av>||,||]Ta<7®a<7||} 

= max | ii A ^ 12) ® A <r 11»11H A ^ 21) ® ii | > 

where the last equation follows because we just removed from the maximum terms corresponding to norms 
of submatrices of the remaining terms. Now we can use Lemma [12] to bound the terms in this expression 
(we only give one of the bounds in detail, the other one follows mutatis mutandis from a mimetic argument). 
Let us start with the following simple observation: 


||^A( 12 )®A CT ||<^|jA( 12 )||||A ff || 

a a 

<^pi77iiyy>"i 2 
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The following chain of inequalities provides the final bound: 


EllN ,2, ll 2 <Ell A " !| lli 

a (T 

n n 

= EE E A -(*>i ) 2 

<7 2=1 j=n +1 
n n 

= E E|E A ^') 2 

j=n-\-l 2=1 1 o 

n n 

E E^E A -(bi ) 2 

n j=n +1 2=1 <7 

n 

= — E - a o(j) 2 ) 

Sri, . , 

j=n+1 

, Sn+1 + • • ’ + S n 

Sft 


□ 


Now we can put all the above together in order to obtain the bound in Theorem El We start with the 
following bound for Aj for some fixed t, which follows from simple algebraic manipulations: 


|At|< \ C' lP )+C' 2 


. t-1 N 

-/ f Pf 


Sn 


\/Sn+1 


where C[ = 2||a 0 | 


and C 2 = 2||ao| 


(Ea II A- 


A/2 


are constants depending only on /. 


Finally, using that ||/ — /1|| = ]T] t>0 At > we can sum bounds on A t and obtain 


11/ - / 111 < ^Ci + \/ Sn+l + • • ’ + s„ 

< Cf yjSn + 1 + ■ ■ ■ + S n , 


with Ci = C[/(l — pf), C -2 = C 2 / (1 — p f ) 2 , and Cf = C\+Cil ^J*n- This concludes the proof of Theorem 11 
Note that Assumption [l] played a crucial role in asserting the convergence of Eoo/ 3 /- Although the 
assumption may not hold in general, it can be shown using the results from [31 Section 2.3.4] that there exists 
a minimal WFA D with fu = f 2 such that || D<r|| < 1. Thus, by Theorem[l]we have D CT = Q^ 1 A® 2 Q for 


some Q. It is then possible to rework the proof of Lemma 14 using D and obtain a similar bound involving 
IIE.d.ii instead of ||C|| and ||C||. The details of this approach will be developed in a longer version of this 
paper, but it suffices to say here that in the end one obtains the same bound given in Theorem El with a 
different constant C'j. 
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